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In this course, I try to provide a few basics required for performing data analysis in astero- 
seismology. First, I address how one can properly treat times series: the sampling, the filtering 
effect, the use of Fourier transform, the associated statistics. Second, I address how one can 
apply statistics for decision making and for parameter estimation either in a frequentist of a 
Bayesian framework. Last, I review how these basic principle have been applied (or not) in 
asteroseismology. 



Throughout human history, as our species has faced the frightening, terrorising fact 
that we do not know who we are, or where we are going in this ocean of chaos, it 
has been the authorities the political, the religious, the educational authorities who 
attempted to comfort us by giving us order, rules, regulations, informing - forming 
in our minds - their view of reality. 

To think for yourself you must question authority and learn how to put yourself 
in a state of vulnerable opcn-mindcdncss, chaotic, confused vulnerability to inform 
yourself. 

Timothy Leary in Sound Bites from the Counter Culture (1989) 
1. Introduction 

This paper attempts to provide a summary of the course I gave during the 25 th Canaries 
Island Winter School. In no way, this course should be perceived as the final answer to 
a problem. I hope that this course can serve as a basis for students, fellow scientists to 
go beyond what is written here. As in many approaches that I have pursued, this work 
is a snapshot of where I am and hopefully a possible starting point from which one can 
expand to other paths not yet ventured. 

This course starts with a short historical introduction on signal processing and statis- 
tics or how our forefathers started doing data analysis more than 200 years ago. The 
second part is related to the sampling and acquisition of continuous physical signals for 
subsequent analysis in a digital world. The third part contains with a broad review of 
statistics from the so-called frequentist and Bayesian points of view. The last part is 
related to the applications of the previous concept to data analysis for asteroseismology, 
which also includes a description of the physics behind that latter terms. 



2. Historical overview 

The basic principle of asteroseismic data analysis can be summarised as follows: 

• acquire signal from a finite world 

• compute the Fourier transform of the discrete signal 

f Lecture notes given at the 22 th Canary Islands Winter School, November 2010. 
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• extract the characteristics of the harmonic signals 
The first step is closely related to approximation of continuous function using decomposi- 
tion on a base of orthogonal functions. The latter could be the sine and cosine functions 
used in the second step, that is Fourier decomposition or transform. The last step is 
related to inference based on statistics or applications of probability to the data. Here- 
after, I will try to place in a historical perspective all of these steps. The perspective will 
be presented in a chronological order for each subject. This historical review does not 
pretend to be complete as I am not an epistemologist. The main goal of this historical 
review is to give the reader some keys on reflecting on some tools that we regularly use. 
The reader will then be free to delve on the subject or leave it aside. 



2.1. Spectral analysis and Digital signal processing 

The father of spectral analysis is the well known Joseph Fourier. He pioneered the decom- 
position of an arbitrary function in cosine and sine functions in his memoir on heat. In 
Chapter 3 of his memoir (p. 257). [Fburier (1822) expressed the decomposition in harmonic 
functions that later became, using Euler's notation, the complex Fourier transform. In 
the original formulation of Fourier lies explicitly the potential for a finite summation. In 
other words, any arbitrary function can be approximated with a finite summation over 
the Fourier coefficients, which is the Discrete Fourier Transform (DFT). In essence, this 
is the introduction of a representation of a continuous world using a finite set: a digital 
world. The expression provided by Fourier was already a digital description of the world. 
The use of sine and cosine functions was very much at the center of the mathematical 
world at the beginning of the 19 th century. In 1805, Carl F. Gauss devised an algorithm 
for interpolation of cosine and sine functions which would later be reco gnised as th e 
Fast Fourier Transform (FFT), work which was posthumously pu blished (Gauss 18661 ) . 
This work was publish ed in Latin but translation can be found in iGoldstinel (jlimfT and 
iHeideman et al. (Il985 ). The proposal 27 of Gauss' work is what we now call the FFT 
(jHeideman et al." 11985 ). The original algorithm was discovered (not to say re-discovered) 
by James Cooley and John Tuckey in 1965, while they were w orking at IBM and the Bell 
Telephone Laboratory, respectively ( Coolev fc Tuckevlll965l ). As anticipated by Gauss, 
they showed that the DFT could be speeded up by using the fact that the number of 
points N in the transform could be expressed as a products of prime numbers, thereby 
speeding the computation by O ( lo ^ N ^ . 

At the time of the publication of the paper bv lCoolev fc Tuckevl (|l965h . spectral analy- 
sis had already entered the modern a ge of the digital world. When working at the AT&T 
Bell Telephone Laboratory, Claude (lShannonlll949h introduced the so-called sampling 
theorem which is key for reducing a continuous finite-bandwidth signal to a digital sam - 
ple. The frequency at which the signal should be sampled was derived bv lNvquistl(|l924l ). 
hence bearing the name the Nyquist frequency (Harry Nyquist belonged to what became 
later the Bell Telephone Laboratory). The sampling theorem is at the basis of all digital 
audio equipment since the invention by Sony of the digital audio disc in 1972, later to 
become the Compact Disc of Philips in 1978. Digital signal processing (DSP) is used in 
many applications such as speech recognition, compression, audio sampling, home cinema 
and of course in scientific processing. 

It is rather surprising to realise that the existence of the digital world dates back to 
the beginning of the 19th century. In a way, it is not so strange that we need to describe 
an infinite world with a finite set of data, for instance using function interpolation. Being 
ourselves finite or limited in time and space, our world was bound to become sooner or 
later digital. 
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2.2. Probability, statistics and inference 

Probability and statistics are related to one another. Probability provides the mathemat- 
ical foundations for assessing the chance that a random event will occur. While statistics 
using probability theory provides inference on what has been really observed. Probability 
theory started with Jakob Bernoulli who was applying combinatorial analysis for calcu- 
lating probabilities related to the games of chance. He published what is known a s the 
Bernoulli distribution and also introduced the law of large numbers (Bernoullilll713 ). The 
same Bernoulli distribution was approximated bv lDe Moivre ( 17181 ) which was a special 
case of the Central Limit th eorem. Later on, R everend Thomas Bayes solved a problem 
that was left untouched by I5e Moiyrd (1 17181) rel ated to the probability of occurrence 
of unrelated events. Proposition 5 of iBavesI (| 17631 ) is w hat is known t oday as the Bayes 
theorem. This theorem was also found independently bv lLaplace (1774) who was working 
on the same subject. Unfortunately, this view on probability qui te advan c ed at the times 
of Bayes and Laplace was not used until it was re-discovered bv lJeffrevs)(ll939l) . 

Inference is related to how one can deduce from data a theoretical model of the world 
being observed (with error bars on this model). The origin of the first inferenc e can be 
trace d back to the work of Laplace, related to the use of the arithmetic mean ((Laplace 
177J). Gauss demonstrated, using the Maximum Likelihood Principle, that an estimate 
of a parameter me asured many times can indeed be expressed as an arithmetic mean of 
these obser vations ( Gauss 1809h . A typical inference called Least Square^ minimisation 
was used bv lLegendrd (|1805l ) for deriving the orbit of comets, and for verifying the length 
of the meter through the measurement of the Earth's circumference. This techni que was 
also found before Legendre by Gauss but was published until later ( Gaussl 1809t ). Gauss 
also derived w hat was called the law of errors: the so-called Gaussian distribution of errors 
( Gaussl Il809h . The use of this ubiquitous error di stribution is a simple consequence of 
the principle of theMaximum Entropy Distribution (jjavne ; the distribution simply 

reflecting the state of our knowledge (or lack of) by knowing the mean value (i and the root 
means square deviation a of a set of observations. The principles behind maximisation 
are at the very heart of inference. While Gauss introduced the concept for likelihood, 
Fisher set the proper mathematical background and theory behind the use of Maximum 
Likelihood Estimat ors, notably their asymptotic properties and their information content 
(jFisherlll912lll925h . Around that time, a controversy between Ronald Fisher and Harold 
Jeffreys marked the star t of different view on probability and statistics: frequentists vs 
Bayesian. Jeffreys ( 19391 ) was key in reviving the approach touched upon by Bayes and 
Laplace, an approach that was not the main stream of statistical thinking at the time of 
Fisher. At that point in time, the way was open for Baye sian probabilit y and statistics 
to be applied in various fields of physics and astrophysics. Uavne provides many 

possible applications of Bayesian approaches such as one used for Fourier analysis. 

Now in the 21 th century, it would be naive to believe that inference can only be based on 
Bayesian approaches, it is surely not a panacea. It should be borne in mind that as much 
as we evolve, any field of Science evolves accordingly. Even in statistics the evolution is 
not finished. The curre nt stream is t o try to reconcile the various approaches promoted 
by Fisher and Jeffreys (|Bergerlll997h . Therefore, it is the responsibility of any physicist 
or astrophysicist to follow this evolution. It is perhaps superfluous to remind the reader 
that all inferences on the world as we see it come from instrumentation, observation, 
phenomena that are neither perfect nor deterministic. Let me remind you of a quote 
by Henri Poincare: From this point of view all the sciences would only be unconscious 



t translated literally from Moindres quarres 
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Figure 1. The frequency response of a finite bandwidth signal with the boxcar on top, 
properly sampled according to Shannon theorem (Left), undersampled and aliased (Right) 



applications of the calculus of probabilities. A nd if this calcul us be condemned, then the 
whole of the sciences must also be condemned ( Poincarg 1914 ). 



3. Digital signal processing and spectral analysis 

3.1. Time series sampling 
Shannon! ( 19491 ) provided the theorem which allows to sample a continuous signal whose 



frequenci e s are contained in a finite bandwidth. Using the decomposition in Fourier series, 
Shannon ( 19491 ) wrote that: If a function x(t) contains no frequencies higher than Av , 



it is completely determined by giving its ordinates at a series of points spaced l/( 2Av ) 
apart" , then he wrote: 

x{t n )= / X{v)e l2 ™^dv (3.1) 

J-Av 

where t n = nAt (with At = 1/2A^), x(t) is the function to be sampled, X(v) is the 
Fourier transform of x(t). From this theorem, and the use of Fourier decomposition, one 
can then write: 

n— +oo 

At z(i„)e i: 



X{v) = U{2Av) 



A2irvt„ 



(3.2) 



where II is the boxcar. The term between brackets is simply the original Fourier de- 
composition which is implicitly periodic, hence the use of the boxcar for delimiting the 
frequency space. This is the case represented by the left hand side of Figure [T] Using the 
inverse Fourier transform of Eq. (|3.2p . one can shows that we can fully reconstruct x(t) 
by writing: 

n— +oo 



x{t)= x(t n )sinc I "—^ I (3.3) 

n=— oo 

where sine (=sinx/x) is the sinus cardinal function. This equation shows that one can 
recover perfectly a continuous function using samples of that function at regular spacing, 
whose cadence is provided by the spectral content of that function. In practice, the 
recovery can only be approximated as the summation over oo is impractical. 

3.2. Aliasing 

There are two cases that provide unwanted frequency leaking into the original spectrum: 
• Undersampling of a band-limited signal 



t - t 
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• Non band-limited signal 
This is what is called aliasing whose effect is shown on the right hand side of Fig. [TJ 
In this case high frequency signal leaks, back or aliases, at frequency i'aiias=i / truc — Av. 
Undersampling occurs when the sampling time is larger than Shannon's sampling time 
(l/2Av). Undersampling of a band-limited signal can be easily resolved by applying 
Shannon's theorem. The case of signal having non-limited frequency content is clearly 
not covered by Shannon's theorem. In that case, there are two techniques that can provide 
a reduction of the aliasing power: integration and weights. 

I shall focus on the effect of integration that is often encountered either when observing 
time series or making images with discrete arrays such as Charge Coupled Devices (CCD). 
When one integrates a signal x over some time (At) I can write: 



CobsW = 



1 

At 



t„+At 

x(t)dt 



(3.4) 



where x is regularly sampled at At s = t n +i — t n This equation can be rewritten using 
convolution as: 



x ohs (t) = IH(At s ) * 



i(x*n(At))(t) 



(3.5) 



where III is the Dirac comb. Since the Fourier transform of a Dirac comb is also a Dirac 
comb, the Fourier spectrum of the signal can then be written as: 

XoM = ^-IH [X(v)sinc(Atv)j™ At } (3.6) 

Figure [5] shows the resulting effect of the integration when At = At s . In practice, when 
the signal is band limited, integration will introduce a filtering effect of the high frequen- 
cies. This effect can only be reduced by having a very short integration time with respect 
to the sampling or At <C At s . The effect of aliasing is also shown on Fig. [5] When the 
signal is non-band limited, there are two possible solutions for reducing the effect of the 
high-frequency signal: 

• Introduction of a window function in frequency (the II function) 

• Integration at 100% duty cycle (At = At s ) 

The first solution requires the combination of the time sample by using the Fourier 
transform of the II function which is a sine in time. This solution can only be partially 
implemented as it would require a sum over oo. This solution is naturally implemented 
when making images through a telescope using discrete arrays. In that latter case, the 
highest spatial frequencies are cut off by having the telescope diameter providing a cut- 
off D/ A half that of the spatial pixel sampling; this is illustrated by the left hand side 
of Fig. [5J In other terms, there is no aliasing in a telescope when the pixel resolution is 
half of the telescope resolution, that is two samples per resolution element. De facto, in 
a telescope, the second solution is also used in combination with the first solution. The 
second solution, although not perfect, will reduce the amplitude of the high frequency 
noise in time series. 



3.3. Filtering 

The effect of time series filters can simply be studied by understanding the effect of 
smoothing on time series. Smoothing can be understand as being the application of a 
weighting function sliding in time: a convolution. This can be written as: 



x sm (t) = (x* w)(t) 



(3.7) 



(i 
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Figure 2. The finite bandwidth signal and the sine function related to the integration over 
100% of the sampling time (in black at v = 0), (in grey at v = ±l/2Ai/); when the signal is 
sampled according to Shannon's theorem (Left), when the signal is undersampled (Right). 



where w is the weighting function, which can be complex. Using the Fourier transform, 
this becomes: 

X sla {v)=X(v)W(v) (3.8) 

where W is the Fourier transform of w. The smoothing filter typically provides a low 
pass filter which can be used to derive a high pass filter of the original function x by 
computing: 

xm(t) = x(t) - x sm (t) (3.9) 
Using the Fourier transform, I get: 

X m (v)=X(y)(l-W(v)) (3.10) 

Figure [3] shows the results of applying one time, two times and four times the boxcar on 
a time series. Applying twice the boxcar is equivalent to a triangular weighting function 
(convolution of 2 boxcar function), while applying four times the boxcar is equivalent to 
a bell shape weighting function (convolution of 2 triangle functions). It is rather clear 
from Fig. [3] that boxca r smoo t hing should be avoided because it provides too much 
ringing effect of the tvpe lGibbd (Il898l Il899h discovered. The introduction of a less sharp 



transition for all derivatives by multiple boxcar smoothing provides a neat solution to 
this Gibbs effect. 

Another form of filtering is to use the original series shifted in time by to and then 
subtract the shifted series from the unshifted. In that case the weighting function is 
simply the Dirac distribution w(t) = 5(t — to). Then using Eq. (|3.10[) . the modulus of the 
filter is then simply 

\l-W(u)\ =2sin(7T^ ) (3.11) 

The frequency at which the transmission if half is given by teuton = l/3^o- This kind 
of smoothing filter has been used for the data of the Glob al Oscillation Network Gr oup 
(GONG) using the first difference obtained with to = At s ( Appourchaux et al. 2000l ). 



3.4. Time limits and the Discrete Fourier Transform 

The Fourier transform is not applicable when doing real data analysis. There is no way 
that we can observe an infinite strings of data. We usually observe during a finite time 
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frequency frequency 



Figure 3. (Left) Frequency response of several low pass filter: boxcar (black), triangle or 2 times 
the boxcar (grey), bell shape of 4 times the boxcar (dashed grey). (Right) Frequency response 
of several high pass filter resulting from the previous low pass filters corresponding to the same 
color coding. 



T for which we can compute the following Fourier transform: 

-+T/2 



X T {y) = / x(t)c a7!Ut &t (3.12) 

-T/2 



which can be rewritten as: 



X T (v) = [X* sinc(T)](z/) (3.13) 

The convolution function of the term in bracket is the sine function. Figure [?] shows the 
sine function for adjacent frequency spaced at It is obvious that the spectrum is 
correlated between the various frequency bins. I will show later that the correlation is 
in fact null for frequency bins separated by integer values of ^, and for slowly varying 
power spectra. If I combine, the finite observation with a finite bandwidth signal, I then 
have the Discrete Fourier Transform (DFT): 

X DFT ( Vp ) = At^ x{t n y 2 ™^ (3.14) 
n=i 

with v p = 7p, t n = nAt s , T = NAt s . Equation (|3.14p is simply the truncated original 
Fourier series, which is then by definition periodic with period of A.p = i This property 
directly gives that N is also the maximum value of p. The DFT can be computed using 
the definition given above but it is rather time consu ming as the time for comp uting the 



summation scales as TV 2 . As mentioned in Section 2.. ICoolev fc Tuckevl ( 19651 1 provided 
a faster way based on the factorisation properties of the Fourier transform; in that case 
the time for computing the summations scales as NlogN. 



3.5. Fourier transform of stationary processes 

The application of Fourier transforms to non deterministic or random functions is key in 
doing data analysis in astrophysics. For such a process, I define for the random variable 
x, the power spectral density as: 



St(u) = Ux T {v)f 



(3.15) 
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Figure 4. The sine function as a function of frequency normalised to the resolution (1/T), for 

v = (black), for v = ±1 (grey) 



where Xt(v) is given by Eq. (|3.12p . I can then write St{v) as: 



St{v) = - 



+T/2 p+T/2 



T/2 J-T/2 



x(t)x(t')e 



dtdt' 



(3.16) 



Since the process is stationary I can make the change of variable r = t — t', and then I 
have: 



St{v) = 



+T/2 



T/2 



+T/2 



T/2 



x(t')x(t' + r)dt' 



r dr 



(3.17) 



The term in brackets is by definition the autocorrelation function (Ct(t)) of the process 
taken over a finite window T. Then I have: 

-+T/2 



S T {v) 



T/2 



C T (r)e L 



r dr 



(3.18) 



Equation (|3.18p introduces the so-called Wiener-Khinchin theorem which is essential for 
understanding the spectral analysis of random processes. As a matter of fact, if I also 
assume ergodicity of the process (i.e. the temporal average can be exchanged with the 
spatial average) then I have: 



CM = E(x(t)x(t + t)) = lim C t (t) 

T— ^oo 



I can then write: 



S x (v)= lim E [5 r (i/)] 

T— >oo 



C(r)e 



dr 



(3.19) 



(3.20) 



which is the proper Wiener-Khinchin theorem ( Wieneijll930l : Khinchin 1934 ). Using the 
inverse Fourier transform, I can then also write: 



C(t) 



S x (u)e~ 



'dv 



(3.21) 



This relation between the auto-correlation function and the power spectral density is 
absolutely key in understanding the Fourier analysis of stationary processes. 
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3.6. Parseval's theorem 

For t = 0, I also have 

2 



C(0)= / S^^chv = E(ar) (3.22) 

^ — oo 

This equation is also a n alternate formulation of t he energy conservation principle known 
as Parseval's theorem (Parseval des Cheneslll806h . This important property provided by 



Eq. ([3T2"2"|) can be adapted for normalising the DFT. For the DFT given by Eq. (|3~T4"j) . I 

can write: 

p=N n=N 
p— 1 n— 1 

where a is the required normalisation factor used for the spectral density. For Eq. (|3.14p . 
it is easy to show that the normalization factor a is 1/iV which is the usual factor used 
when computing the DFT. If other definitions of the DFT are to be used, the proper 
normalisation factor a can be derived with Eq. (|3.23|) . This latter equation is used for 
calibrating the spectra coming from different routines, all having different normalization 
factors. 

3.7. Statistics of the Discrete Fourier Transform 

Equation (j3.14p is our bread and butter for extracting the frequencies associated with 
periodic phenomenon observed, for instance, in stars. Whenever one carries out obser- 
vations, one should not forget that the observations come with noise either associated 
with the observed phenomenon or with the instrument providing the data. Therefore, 
it is essential to understand the statistics of the Fourier transform of random variables. 
Let us start with a simple and commonly used example: a random variable x with an 
unknown distribution (with E(x) = and E(x 2 ) = a 2 ) for which all time samples x(t n ) 
are independent from each other, in addition the process is assumed to be stationary. 
Using Eq. (|3.14| . I can then write the DFT of the x(t n ) as: 

n=N n—N 

X rjFT (f p ) -(- iX^ FT (i/ p ) = ^2 x(t n ) cos(2irv p t n ) + i 2J x{t n ) sin(27w p t n ) (3.24) 

n— 1 n— 1 

where X^ FT and X^ FT are the real and imaginary part of the Fourier spectrum. Since 
the x(t n ) are independent and identically distributed (i.i.d.), by virtue of the Central 
Limit Theorem, the statistical distribution of Xf) FT and X^ FT is a normal distribution 
for N > 1 with: 

E(^ FT (^)) = E(J^ PT (^)) = (3.25) 

E([^ FT (^)] 2 ) - E([X rjFT (^ p )] 2 ) = ^a 2 (3.26) 

Then, since X^ FT and A"q FT are independent and have the same normal distribution, 
the statistics of the power spectrum is then by definition a \ 2 with 2 degrees of freedom 
(d.o.f.). 

Unfortunately (or fortunately) none of the processes that we observe have the prop- 
erties of being i.i.d. The processes are usually stationary processes but not i.i.d. because 
these processes have usually a memory such that the correlation of x(t n ) and x(t m ) 
are different from zero when t n ^ t m . Nevertheless, in that case, it can be demonstrated 
that the components of the Fourier transform are also both normally dist ributed with the 



same mean of zero and the same variance, which depends upon frequency ([Peligrad fc Wu 
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2010). It is amusing to quote Peligrad and Wu on their finding: l In this sense Theorem 



2.1 [of IPeligrad fc Wul (|2010h ] justifies the folklore in the spectral domain analysis of 



time series: the Fourier transforms of stationary processes are asymptotically indepen- 
dent Gaussian." 

3.8. Time series sampled at unevenly times 

It is quite common in astrophysics to have samples that are not equally spaced in time. 
This lack of uniformity could be due to a variable number of photon per seconds or due 
to variable detector read out ti me. Usually, thi s can be avoided by carefully designing 



the electronics, see for instance iFrohlich et al.l ( 1997 ). Nevertheless, if the time series 



are unevenly sampled, there are ways and means to find solutions. Equation f|3 . 14[) can 
still be applied, obviously by dropping At s . The problem of the latter equation is that 
for unevenly times the statistics of the Fourier spectrum is no longer \ 2 with 2 d.o.f. 
( Scargld 1982 ). Equation (|3.24j) can be adapted such that this statistical property is kept 



by writing: 



^ls i v v) = —T\ cos(2irv p (t n - t)) (3.27) 

W(T) ^— ' 



x is (»p) = ttV x(tn) sin(27w p (t„ - r)) (3.28) 



where w and v are given by: 

;(r) = ^2 cos 2 (27Tj/ p (i„ - r)) (3.29) 



It ' 



n=l 



n=N 

v(t) = sm 2 (2iris p (t n - r)) (3.30) 

n=l 

and t is introduced for keeping the invariance in time of the transform given by Eqs. (|3.27p 
and (|3.28p . r is given by: 

22n=l C0S(27Ti4 n ) 

The definition provided by Eqs. p.27p and (|3 . 28[) has the benefit of giving a power 
spectrum or Lomb-Scargle (LS) periodogram ([X^g (v p )] 2 + [X^ s {v p )\ ) which is \ 2 
with 2 d.o.f. (|Scarglelll982r) . It must also be pointed out that Eqs. (|3.27[) and (]3 . 28[) are 
also the solution obtained when applying Least Squares minimisation to 

n—N 

[x(t n ) — a c cos(27r^i) — a s sin(27wi)] 2 (3.32) 

71=1 

where a c and a s are given by Eqs. (|3.27p and (|3.28p . respectively. For s peed, the LS pe- 



riodog ram is usually computed using the implementation prescribed by iPress fc Rvbicki 
( 19891 ) which is an approximation of the LS periodogram based upon extirpolatioi^ 



a regular mesh and the use of the FFT. The prescription is then very close to interpo- 
lating onto a regular mesh. It is worth noting that most users of the LS periodogram for 



| Reverse interpolation or extirpolation replaces a function value at any arbitrary point by 
several function values on a regular mesh 
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unevenly sampled data are in fact computing the FFT of the original data r esampled 
onto a regular mesh, but with a proper normalization as given bv lScargld (|l982h . It must 
be noted that the LS periodogram does not provide a better solution to coping with the 
presence of gaps. The reason is that although the Fourier transform explicitly includes 
gaps as zeros, adding zeros is also implicitly performed with the LS periodogram. As a 
consequence, correlations between frequency bins also exist with the LS periodogram, 
but these are generally ignored. The correlations in the Fourier transform in the presence 
of gaps are addressed in the next section. 

3.9. The influence of gaps in the time series 



The impact of the gaps on the Fourier spectrum has been described by iGabriell (|1994f ) . 
The gaps introduce correlation between frequency bins that need to be taken into ac- 
count when one wan ts, for example, to fit the power spectrum (See for applications 
Stahn fc Gizonl [20081) . Hereafter, I will provide the result regarding the correlation of 
the Fourier spectrum between the frequency bins. Assuming that I observe, a random 
variable x through a window W, the Fourier transform can be written as: 



X(v) = I x(t)W{t)c l27UJt At 
The mean correlation between two frequency bins V\ and v 2 is given by: 



E[X{ Vl )X*{v 2 )\ =E 



x^xit^w^wit'y 2 ^^ 2 ^' dm' 



(3.33) 



(3.34) 



where * denotes the complex conjugate. Following Gabriel (1993), I have the following 
properties for the real and imaginary parts of X: 



EiXi^x;^)} = E[#i(zA)^> 2 )] 

E[* r (i/i)^> 2 )] - -E[*i(n)#(i*)] 
Using the Fourier transform of x(t), I can rewrite Eq. ()3.34p : 



(3.35) 
(3.36) 



E[X(v l )X*(v 2 )]= I j j j E[A(^A(^)]^(t)VF(t')c i27r[( " 1_!y ' )t_(!y2_ " )t ' 1 dtdt'd^' 

(3.37) 

By construction, I assume that there is no correlation between the real and imaginary 
parts of the original spectrum X, that their variances have the same valu e E(A, 2 (^)1 an d 
the frequency bins of the original spectrum are not correlated (See also. lGabrielHl993T ). 
such that I have: 

E[X r (i/)X r (i/)] = E[X?(v)]5(i/ - v) 

E[X t {v)X t {v')\ = E[Xi(y)Xi(i/)] 
where S is the Dirac distribution. Then I can rewrite: 



(3.38) 
(3.39) 



E[X{ Vl )X*{v 2 )] 



E[X 2 (v)}[W(vi - v)W(y - v % )\dv 



(3.40) 



where W is the Fourier transform of the window function w. For understanding the impact 
of Eq. (|3.40p , let us assume that we observe white noise of mean and of variance ctq m 
frequency. In that case, I have: 



E[X(iy 1 )X*(u 2 )]=2al 



+oo 



W{y x - v)W{v - v 2 )Av 



(3.41) 
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Using the properties of convolution and the inverse Fourier transform, it can be showiU 
that I have: 

/>+00 

E[X{v!)X*{va)] = 2a 2 / uf i (t)fp*< vi - , ») t dv (3.42) 



The integral in this equation is simply the Fourier transform of the square of the window 
function, or W scl . For a window function such as the one provided by an observing window 
of length T (See Eq. I3.12p , the correlation is then given by the sine function. This justifies 
a posteriori the sampling at frequency interval of 1/T, for which the correlation is null. 
If the variations of E[X r 2 (i/)] are slow with respect to W(v), I can also rewrite Eq. (|3.40p 
as: 

E[X{ Vl )X*{ V2 )] » 2E[X r > 1 )]lU s > 1 - v 2 ) (3.43) 

where W sq is the Fourier transform of w 2 . Of course this approximation does not hold 
when the variations in frequency are over scales of 1/T. Nevertheless, Eq. (|3.43[) gives an 
interesting solution for understanding the correlation between frequency bins in a Fourier 
spectrum. 

It is also useful to understand the correlation between the real and imaginary parts 
of the Fourier transform. Using Eqs. (|3.35p and (|3.36[) . I can derive the very useful 
formulation: 

/+00 
E[X*(y)]H [W{ut - u)W(y - v 2 )} dv (3.44) 
-OO 

/ + OO 
E[X 2 {v)]l [W(vt - v)W(v - v 2 )\ Av (3.45) 
-00 

where 1Z and X denote the real and imaginary operators, respectively. These two relations 
can be used when a specific window which is different from the boxcar is used. For 
example, the use of weights across the observing window will then introduce correlation 
provided by Eqs. Q3.44p and Q3.45p . The introduction of these weights or tapers is described 
in the next section. 

3.10. Taper estimates of Fourier power spectrum 

Fourier spectrum estimation is well adapted for periodic signals (pure sine waves or 
stochastic waves) but not necessarily well suited for estimating the spectral density of 
frequency-dependent noise (pink or red noise). For that purpose, one can: 

• average the power spectrum over an ensemble of n sub-series, 

• smooth the power spectra over n frequency bins or, 

• use multitapered spectra using the full time series for deriving a similar average. 
Fourier spectrum estimation can be replaced by multitapered spectra that are widely 

used in geophysics (for a review see lThomsonlll982l) . Multitapered spectra are generated 
by applying a set of tapers to a single time series, and an estimate of t he mean powe r 



spectrum is derived from an average of these spectra. Using tapers due to lSlepianl (]1978I ). 
the multitapered spectra are statistically independent from one another, and the statis- 
tics of the mean spectrum follows a y 2 dist ribution with 2n degrees of freedom (where 
n is the number of tapers, I Thomson1 Tl9 8 2) . While the statistics of the average power 
spectrum (or smoothed power spectrum) also follow a \ 2 distribution with 2n degrees of 
freedom: the resolution of the average spectrum is n times lower than that of the mul- 
titapered spectrum. In helioseismology the use of these slepian taper s has been replaced 
by more practical (but less accurate) sine tapers ( Komm et al. 1999). Unfortunately, for 



f I leave the demonstration to the reader 
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sine waves, tapers tend to broaden the peaks, as shown bv I Thomson! (1982). Tapers as 
such provide more benefit for broader peaks than for narrower peaks. 



4. Data analysis and statistics 

4.1. Hypothesis testing 

Statistical testing is essential when one wants to decide: have we found a signal or not! 
This is related to decision theory, which can be summarised as how do we choose between 
one hypothesis versus another in the presence of uncertainties? In this area, there are 
two schools of thought: the frequentist school and the Bayesian school. 

The difference between a Bayesian and a frequentist relates to their views of subjective 
versus objective probabilities. A frequentist thinks that the laws of physics are determin- 
istic, while a Bayesian ascribes a belief that the laws of p hysics are t r ue or operational. 
The subjective approach to probability was first coined bv lDe Finettil (|l937t) . 

For the rest of us, the difference in views between frequentists and Bayesians can be 
outlined by taking an example from the Six-nation rugby tournament. For a frequentist, 
France has been winning over England in their direct confrontation only 39% of these 
matches since 1906. Based on this result, for a frequentist, France has only 39% chance 
of winning any future game against England. For a Bayesian, this is a complete different 
story. A Bayesian may attribute higher chance or lower chance to France to win a given 
game based on the current physical and technical skills of each player, on the current 
ability of the players to play as a team and on the psychological and mental health of 
the players and of the team as a whole. Based on this assessment, a Bayesian might have 
attributed 70% chance to win the 2010 game, or 20% chance to win the 2011 game. This 
is an a posteriori evaluation of the chances as both games took place while writing this 
course. It is used as an example. 

In short, frequentists assign probability to measurable events that can be measured 
an infinite number of times, while Bayesians assig n probabilit y to events that cannot be 
measured, like the survival time of the human race (|Gottlll994l ) or like the future outcome 
of sporting matches. 

In what follows, I will try to give an overview of what / believe I know on a subject 
that is rapidly evolving; and what I write is certainly not gospel. 



4.1.1. Frequentist hypothesis testing 

For a frequentist, statistical testing is related to hypothesis testing. In short, we have 
two types of hypotheses: 

• Ho hypothesis or null hypothesis: what has been observed is pure noise 

• Hi hypothesis or alternative hypothesis: what has been observed is a signal 

For the Ho hypothesis, I assume known statistics for the random variable Y observed as 
y and assumed to be pure noise; and then set a false alarm probability that defines the 
acceptance or rejection of the hypothesis. The so-called detection significance (or p-value, 
terms not widely used in astrophysics) is the probability of having a value as extreme 
as the one actually observed. There is an on-going confusion because statisticians call 
the significance level what astronomers call the false alarm probability; and statisticians 
call the p-value what is set in astronomy as the detection significance (which is not the 
significance level). Here I shall use the current vocabulary understood in astronomy. For 
example, the false alarm probability p for the Hq hypothesis is defined as: 



p = P Q (f(Y) > f(y c )), 



(4.1) 
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Figure 5. Detection probability as a function of the false alarm probability for pure sine waves 
stochastically excited for various signal-to-noise ratio: 1 (continuous line), 2 (dashed line), 10 
(dot-dashed line) 

where T is the statistical test, and P$ is the probability of having T(Y) ^ T(y c ) when 
Ho is true; and y c is the cut-off threshold derived from the test T and the value p. For 
example, take the case of a random variable Y distributed with % 2 , 2 degrees of freedom 
(d.o.f) statistics, having a mean of a. If I further assume that T(Y) = Y, I then have 
that: 

P = Po(Y ^y c ) = c~^ (4.2) 

If one observes a value y of the random variable Y that is larger than y c , the Ho hypothesis 
is rejected. The value that is quoted in this case is the detection significance D, i.e., 

V = e"- (4.3) 



The H n hypothesis was used by IScargld (| 1982T) for setting a false alarm probability, 



and by Appourchaux et al. ( 2000t) to impose an upper limit on g-mode amplitudes. The 



method was based on the knowledge of the statistical distribution of the power spectrum 
of full-disc asteroseimic instruments, namely the x 2 distribution with 2 d.o.f. For the Hi 
hypothesis, I assume given statistics both for the noise and for the signal that we wish 
to detect, and set a level that defines the acceptance or rejection of that hypothesis. 

In this example, I took as given that the test T was known (T(Y) = Y). As a matter of 
fact, such a test is not obtained in an ad hoc manner but can be rationally derived using 
the Neyman-Pearson lemma. An example of the application of this lemma for deriving 
the test and level provided by Eq. (|4.2[) is explained in the next section. 
The Neyman-Pearson lemma. The derivation of the best test and of the levels as- 
sociated with the Hp and Hi hypotheses is provided by the Neyman-Pearson lemma 



( Nevman fc Pearson 19331) . This lemma is very useful for designing tests that will max- 



imise signal detection while minimising noise effects. 

Lemma 1. 3n > such that A(y) = §?|l§4 «S V where P(A(y) ^ rj\H. ) = a 

where L(y\Ho) and L(y\Hi) are the likelihood for each hypothesis and a is called the 
power of the test. I will show later how one can use such a lemma for a specific case 
that is often encountered in astronomy: the detection of single frequency peak in a power 
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Table 1. Types of error obtained for different decisions, based upon the statistical test 
performed, and how the error relates to the status of the Ho hypothesis. 





Status of Ho 




True False 


Reject 


Type I Correct 


Decision 




Accept 


Correct Type II 



spectrum of a star having eigenmodes with very long lifetimes. Let us assume that we 
observe pure noise in a power spectrum, since the statistics is \ 2 with 2 d.o.f. I can write 
the likelihood of observing a value y as: 

nmo) = ^o-y/ B (4.4) 

where B is the mean noise level in the power spectrum. Next, I assume that the peak is 
not deterministic but that its amplitude is stochastic with an amplitude A. The likelihood 
for Hi is then: 

Lmi) = rJTA c ^ /(B+A) (4-5) 

The likelihood ratio then can be written as: 

A(y) = (1 + H)e~$* (4.6) 
with H = A/B. Then applying the Neyman-Pearson lemma leads to: 

A(») ^v^y>v' (4-7) 

where rf is given by solving 

P(y > V|Ho) = e" * = a (4.8) 



which justifies a posteriori the use of Eq. (|4.2[) . Then I can also write the detection 
probability of a sine waves as: 

P(y ^ t/|Hi) = e"^-ff = QfTTTf (4.9) 

Figure [5] shows the result for the detection probability for sine waves stochastically ex- 
cited. Such a diagram is also called the receiver operating characteristic (roc) . It provides 
a very efficient way of assessing the performance of the statistical test used. In summary, 
the Neyman-Pearson lemma can be used for deriving in a non-arbitrary fashion the best 
test for accepting/rejecting Ho. With this lemma, the design of a test is therefore more 
systematic and less prone to improvisation. 

Is the world dichotomic? Taking a decision based on the result given by a single test, 
for either hypothesis, could lead to errors in the decision process. For instance, the null 
hypothesis could be wrongly rejected when it is true (false positive or wrong detection), 
but could also be wrongly accepted while it is false (false negative or no detection in 
presence of a signal). The false positive results in a Type / error, while the false negative 
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results in a Type II error (see Table [TJ. The ideal case would be, using the Neyman- 
Pearson lemma, to set a test that would minimise the occurrence of both types of errors 
It has been cus tomary when applying the Ho hypothesis to set the decision level ar- 
bitrarily at 10% ( Appourchaux et al. 20001 ). From the frequentist view point there is 



nothing wrong in setting a priori the decision level before the test is applied. There are 
three types of result we might obtain from applying the test: 

(a) Ho always rejected 

(b) H rejected or accepted at a level very close to 10% 

(c) H always accepted 

Decision (a) will lead to the mention of a detection being statistically significant at a 
level provided by the detection significance (for example from Eq. (|4. 3[) 1 . The question is 
then to know what was detected. The next step would then be the application of a test 
for the Hi hypothesis taking into account assumptions about the detected signal, which 
may very likely result in the detection of signal. Decision (c) seems straightforward, i.e., 
noise dominates, but might one then be tempted to lower, a posteriori, the decision level? 
Decision (b) is the more difficult borderline case, forcing us to either accept or reject Ho- 
Here, we might ask: are things really that clear cut? What are the chances that if we 
accept Ho it is actually wrong (Type II error), or truly right if rejected (Type I error)? 

These potential actions result from the application of a frequentist test trying to answer 
the following question: what is the likelihood of the observed data set y, given that Ho 
is true or p(y|Ho)? The detection significance mentioned when the test rejects the Ho 
hypothesis is nothing but p(y|Ho), when actually what we want to know is the likelihood 
that Ho is true given the data, i.e., p(Ho|y) (^ p(y|Ho)). The frequentist view does provide 
a useful answer when one can repeat the observations ad infinitum. But when we have 
only one universe, one observation, another approach must be used based upon Bayes' 
theorem; an approach which in principle gives access directly to p(Ho|y). 

4.1.2. Bayesian hypothesis testing 

On the posterior probability. We should never forget the two sides of the coin: if 
probability (likelihood) can justify alone the rejection or acceptance of an hypothesis, 
this probability is not the significance that the hypothesis is rejected or accepted. The 
decision levels discussed above are related directly to a well-known controversy in the 
medical field, concerning improper us e of Fisher's p- val ues as measures of the probability 
of effectiveness of a medicine or drug ( Sellke et al.ll2001 ). The detection significance (or p- 



value) is improperly used as the significance of the evidence against the null hypothesis. It 
is far from trivial at first sight to understand what is wrong with the detection significance. 
Let us recall the example I gave above for a random variable Y having a \ 2 with 2 d.o.f 
statistics. In that case the detection significance is given as: 

V = c~i ^P (Y >jf). (4.10) 

The latter statement (^) is fundamental. The observation is performed only once provid- 
ing a value of y and hence the detection significance T>. But in no way does it provide the 
probability that the random variable is always above y (or Pq(Y ^ y)). It is not correct to 
assume that if the observation were repeated it would provide the same level y. The mis- 
take is to ascribe a significance to a measurement performed only once, i.e., not repeated, 
and spanning just a very small volume of the parameter space (e.g. Y £ [y,y + 5y\). If 
one makes a measurement y of the random variable Y that is above y c , the significance 
of that measurement is not e~*' cr . In the framework of Bayesian statistics, we are not 
interested in the detection significance but in the posterior probability of the hypothesis 
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p(Ho|y), in other words as already stated above jp(Hp|j/) ^ jp(j/|Hp). A simil ar description 
of this misunderstanding has been presented bv ISturrock fc Scargle ( 20091 ) 



In order to derive th e poste r ior pr obability p(Ho|y), let us first recall the Bayes' the- 



orem. The theorem of iBavesI (j 17631 ) relates the probability of an event A given the 
occurrence of an event B to the probability of the event B given the occurrence of the 
event A, and the probability of occurrence of the events A and B alone. 

P (A|B ) = «<Ai (4.U) 

For example, the probability of having rain given the presence of clouds is related to 
the probability of having clouds given the presence of rain by Eq (|4. 1 1 [) . The term prior 
probability is given to P(A) (probability of having rain in general). The term likelihood 
is given to P(B|A) (probability of having clouds given the presence of rain) . The term 
posterior probability is given to P(A|B) (probability of having rain given the presence of 
clouds). The term normalization constant is given to -P(B) (probability of having clouds 
in general). 

The posterior probability of a hypothesis H, given the data D and all other prior 
information I, is stated as: 

W,« = «f. (4.12) 

where P(H|I) is the prior probability of H given I, or otherwise known as the prior; P(D|I) 
is the probability of the data given I, which is usually taken as a normalising constant; 
P(D|H, I) is the direct probability (or likelihood) of obtaining the data given H and I. 
Berger fc Sellkd (Il987l) obtained, using Bayes' theorem, p(Ho\y) with respect to p(y\Ho) 



and p(y|Hi), where Hi is the alternative hypothesis. 

/tt p(Ho)p(y|H ) 

I set po = p(Ho), and since we have p(Hi) = 1 — po, they finally obtained: 

p(H |y)= (l + ^—BlA , (4.14) 
V Pa J 

with C being the likelihood ratio defined as: 

r p(y|Hi) , . 

L- = /-ITT \ * (4-15) 

Here, p(Ho\y) is the so-called posterior probability of Ho given the observed data y. 
Naturally there is no way to favour Hg over Hi , or vice versa, otherwise our own prejudice 



would most likely be confirmed by the test, i.e. po = 0.5. Subsequently, iBerger et al 



( 19971) recommended to report the following when performing hypothesis testing 



if C > 1, reject Ho and report p(Ho\y) = — , (4-16) 

if C ^ 1, accept H and report p(Hi|y) = -— j . (4-17) 

The advantage of such a presentation is that even for a borderline case, say when the 
ratios above are close to unity, it is clear that there is only a 50 % chance that the Ho 
hypothesis is wrongly accepted, or wrongly rejected. This presentation is more honest 
and better encapsulates human judgement and prejudice. 
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Figure 6. On the left-hand side, likelihood ratio C as a function of the mode amplitude for 
detection significances of 10% (solid line), and of 1 % (dashed line); the noise is set to unity. 
On the right-hand side, the posterior probability of Ho as a function of mode amplitude for 
detection significances of 10% (solid line), and of 1 % (dashed line) (from Eg. 14.33]) ]. 



Example of posterior probability. Using the example given in Section [4. 1.11 for the 

detection of sine waves, we can derive p(Ho|y) using Eq. (|4.6p as: 

p(H |y)= { 1 + t^p' h/{1+h) ) ( 4 - 18 ) 

where p = e~^^ B is the detection significance. Figure [5] show the results for two different 
detection significances. When the detection significance is 10 %, the likelihood ratio can be 
greater than unity for large values of the mode amplitude, leading to the acceptance of the 
null hypothesis. This is rather paradoxical, i.e., that large mode amplitude can lead to the 
rejection of the alternative hypothesis. To resolve the paradox we note that the posterior 
probability of Ho is in any case never lower than 40%, or the posterior probability of Hi 
is never higher than 60%. This implies that both hypotheses are equally likely when the 
detection significance is as low as 10 %. In other words, when we set, a priori, a large mode 
amplitude and get a low detection significance, the alternative hypothesis is as likely as 
the null hypothesis. In other words, the assumption about a large mode amplitude is not 
supported by the data. 

The main conclusion to be drawn from this calculation is that the detection significance 
should be set much lower than 10 % in order to avoid misinterpretation of the result. For 
example, with a detection significance of 1 %, the p osterior probability for Ho can fall to 
10% when the signal-to- noise ratio is above unity. Sellke et al.l ( 200ll ) showed that the 



posterior probability can never be lower than the lower bound: 



p(H \x)> M-__ (4.19) 
\ ep In p J 

The reader may verify for themselves that this lower bound is effectively reached for 
Eq. (|4.18[) . In the case, when the amplitude of the mode A is not known, one needs to 
set, a priori, the value for the likely range of amplitudes. In the case of a uniform prior, 
the posterior probability p(Hp|g;) then does reach a minimum that is higher than the 
lower bound of Eq. (|4.19p ( Appourchaux et al. 2009f) . 



In summary, the significance level should not be used for jus tifying a dete c tion ( or a 
non-detection). Instead I recommend using the prescription of iBerger et al. (1992), as 



given by Eqs. (|4. 16|1 and (|4.17jl and to specify the alternative hypothesis Hi. 
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On the choice of the prior probability One important question when applying 
Bayesian statistics is what value should the prior probability of the hypothesis Ho, i.e., 
Po, take? We define the prior probability as the probability that the Ho hypothesis is 
correct. The probability that the alternative Hi hypothesis is correct can then be defined 
as p(Hi) = 1 — po- It is common to set po = 0.5 so as to avoid prejudicing one hypothesis 
over the other. Would we expect the probability that Hi and H are true to be the same 
in all instances? Since Bayesian statistics requires a priori knowledge, it is possible to 
use our knowledge of physics/astrophysics to tell us which hypothesis is more likely to 
be true in a given circumstance. 

4.2. Parameter estimation 

The previous section on hypothesis testing is really the prerequisite when one wants to 
assess if there is a signal sought in the observation. Unfortunately, knowing that a signal 
is present does not provide any pertinent information for doing physics or astrophysics. 
This is the goal of parameter estimation. 

Parameter estimation is a vast subject in statistics. Below, I will introduce the esti- 
mations that are the most commonly used in astrophysics. The estimations described 
hereafter are also related to the frequentist and Bayesian world. As we will see, these 
estimations are not so foreign from each other. The frequentist estimation can be used 
when the signal-to-noise ratio is high, while the Bayesian estimation is more useful (but 
more time consuming) when the the signal-to-noise ratio is low. 

4.2.1. Maximum Likelihood Estimation 

As shown in the Historical Overview section, iGausd dl809h introduced the concept of 



Maximum Likelihood Estimation or MLE. The aim of MLE is to find the set of parameters 
that maximise the likelihood of the observed event, this is a point-like estimation. Having 
observed a random variable x with a probability distribution p(x, A), where A is a vector 
of p parameters describing the model behind the random variable x, the likelihood L of 
N observations of x is given by: 

iV 

L(x t N t X) = Hf{x k ,\). (4.20) 
fc=i 

where the product implicitly expresses the fact that all Xk are independent of each other. 
Usually we define the logarithmic likelihood function £ as 

N 

£(x, N, A) = \nL(x, A) = - ^ lnf(x k , A). (4.21) 

k=l 

The estimate of A is derived from the maximisation of the likelihood as given by Eq. (|4.2ip 
such that we have 

A = m&x£(x, N, A) (4.22) 

Such an estimator has several interesting properties in the limit of very large sample 
(N —> oo) which are: 

• MLE are asymptotically unbiased, 

• MLE are of minimum variance, 

• MLE are asymptotically normal. 
The first property implies that: 

lim E(X) = A . (4.23) 

JV— >oo 
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where Ao is the true value of the model. The second property implies that no other 
asymptotically unbiased estimator has lower variance. Using the Cramer- Rao theorem 
( Cramer|[l946l : iRaql F" 



19451 ). this can be rewritten as: 

1 



'(Ao) 



where /(A) is the Fisher information matrix whose elements are given by: 

~d 2 £(x,\y 



Asymptotically we also have: 



Iij - E 



lim cov 

JV-s-oo 



1 



(4.24) 



(4.25) 



(4.26) 



Finally the third property regarding the estimator being asymptotically normal, can be 
expressed as: 



(4.27) 



where M is the normal distribution. This latter equation is usually used for providing 
the statistical distribution of A as: 



p(A)«JV A, 



1 



H{\) 



where H is the so-called Hessian matrix whose elements are derived from: 

~d 2 £(x,\) 



dXidXi 



with the property given by the Cramer-Rao theorem as: 



1 



— > 



1 



if (A) /(Ao) 



(4.28) 



(4.29) 



(4.30) 



Equation (|4.29[) is used when computing the so-called formal error bars on A; as a matter 
of fact according to the Cramer-Rao theorem, Eq. (|4.29p gives only a lower bound to the 
error bars. 

Significance of estimates. When one uses Least Squares for fittin g data, one ca n test 
the significance of its fitted parameters using the so-called R test (|Friedenl[T98l . For 
MLE, a useful test can be used: the likelihood ratio test. This method requires maximising 
first the likelihood cT^p) of a given event where p parameters are used to described the 
statistical model of the event. Then if one wants to describe the same event with n 
additional parameters, the likelihood e" £ ( n p+"' will be maximised. The likelihood ratio 
test consists in making the ratio of the two likelihood. Using the logarithmic likelihood, 
we can define the ratio A as: 



ln(A) = £(n p+n ) - £{w p ) 



(4.31) 



If A is close to 1, it means that there is no improvement in the maximised likelihood and 
that the additional parameters are not significant. On the other hand, if A ^ 1, it means 
that £(tt p + n ) <C £(uj p ) and that the additional parameters are very significant. In order 
to define a significance for the n additional parameters, we need to know the statistics of 
ln(A) under the null hypothesis, i.e. when the n additional parameters are not needed to 
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describe the model. For this null hypothesis, Wilks (|l938l ) showed that for large sample 



size the distribution of — 21nA tends to the x 2 (n) distribution. 
Calibration of error bars. Before applying MLE to real data, it is always advisable 
to test the power of this approach on synthetic data, i.e. performing Monte-Carlo sim- 
ulations. They are not merely for playing games; these simulations are real tools for 
understanding what we fit and how we fit it. Assuming that the statistics of the data is 
known, performing Monte-Carlo is useful for the following reasons: 

• Assessing the model of the data 

• Assessing the statistical distribution of the parameters of the model 

• Assessing the precision on the fitted parameters of the models 

The parameters derived by the MLE should have the desirable properties of having a 
normal distribution (See above); if not we advise to apply a change of variable on the fitted 
parameters (log a; for instance). A normal distribution is necessary to derive meaningful 
error bars, this is the assumption behind Eq. (j4.29[) . In order to be able to derive a good 
estimate of the error bars using one realisation, the standard deviation of a large sample 
of fitted parameters should be equal to the mean of formal errors return by the fit, i.e. 
this is the approximation of Eq. (|4.27|) by Eq. (|4.28p . In other words we should have the 
following approximation for the inverse of the covariance of the parameters: 

H{\) « 7(Ao) (4.32) 

where H is related to the formal error bars and / is related to the asymptotic error bars. 
This calibration as expressed in this equation is key to derive meaningful error bars. I 
advise the reader to use such a calibration procedure for checking the formal error bars 
derived from software codes fitting function using Least Squares. The formal error bars 
derived by a single noise realisation are only a lower limit to the real error bars. This 
lower limit is never reached when for instance the signal-to-noise ratio is too low. In this 
latter case, we are very far from the asymptotic behaviour. This case is covered by the 
Bayesian approach to parameter estimation. 

4.2.2. Bayesian parameter estimation 

Parameter estimation can also be done using Bayes' theorem. In this case, this is the 
so-called Bayesian inference. Using the same notation as before, I can express using 
Bayes' theorem the probability distribution as: 

f\\ n p(A|I)p(x|A,I) 

p(X\x,I) = — (4.33) 

p{x\l) 

where A are the observables for which I seek the posterior probability, x is the observed 
data set, and I is the information. The prior probability of the observables is given by 
p(A|I): this is the way to quantify our belief about what I seek. The likelihood is given 
by p(x\X, I) which is exactly the L(x, A) of the previous section. Therefore the frequentist 
approach is related to the Bayesian approach simply by the frequentist likelihood, the 
prior probability and the normalization factor p(a;|I). The main advantage of the Bayesian 
approach is that the posterior probability p{X\x, I) is directly accessible while for the 
frequentist approach only the location of the maximum of the likelihood is known. In 
this latter approach, there is no direct visibility of the parameter probability distribution 
but only an approximation provided by Eq. Q4.28P • This is why the frequentist approach 
is a point-like estimation whereas the Bayesian approach is more global. For instance, 
the power of the Bayesian approach is such that it provides the full posterior probability 
which may not be necessarily a normal distribution but could be the sum of many normal 
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distribution due to many local minima. In that case, only the posterior probability can 
provide a correct assessment of the statistics of the derived parameters. 
Posterior probability estimation The main difficulty in Bayesian inference is to derive 
the posterior probability. If the derivation of Eq. (|4.33p is analytical then parameter 
estimation can easily be done (See an example in the Application to Asteroseismology 
section). When this is not possible, the easiest is to compute the posterior probability 
by using a random walk algorithm that will provide the posterior probability from a 
representative samples of the A. A famous examp le of such a procedure is derived from th e 
so-called Metropolis- Hastings algorithm (MH) ( Metropolis et all Il953 ; Hastings! Il970f ). 



Let us see how this algorithm works in practice for a probability distribution p(X) for 
which we want to have a representative sample. We start from a given point and from 
a known probability distribution Q. We draw at random from the probability distribution 
a value A' knowing A^ and we compute the following ratio: 

p(A')Q(AM|A') (4 34) 

p(A(*))Q(A'|A«) 1 ' 

then the new proposed value A' is accepted or rejected following this scheme: 

If r > 1 then A (i+1) = A' 

If r < 1 then 
A (t+i) =x i if r < a 

X (t+i) = X (t) if r ^ a (4.35) 

where a is a random number drawn from a uniform distribution. Asymptotically the A^ 
will then tend to have the probability distribution p(X). The benefit of this algorithm is 
that the computation of the normalisation factor in the denominator of Eq. (|4.33|) is not 
needed. Another obvious benefit is that very complex probability distributions can be 
derived, thereby providing the potential correlations between the various parameters Aj. 

The difficulty in the use of the MH is not in the algorithm itself, which is quite easy 
to implement but in the proper choice of the input distr ibution Q. This is a vast subject 



which goes far beyond this course. The reader will find in lGregorvl (|2005l ) what is required 
for delving into the subject of obtaining the posterior probability distribution using 
various techniques: Gibbs sampling, thermal annealing, convergence and so forth. 
Mean, rms and moment estimation. As soon as the posterior probability is known, 
we can derive an estimate of the moment k of the parameter Aj by deriving first the 
posterior probability distribution of Aj only or p(Xi\x, I). This is done by integrating (or 
marginalising) over the so-called nuisance parameters as follows: 

p(X i \x,T)= [ p(A|x,I)dA„ (4.36) 

where A n G with n =/= i. Then we can compute the moments of the posterior probability 
by writing: 



< A? >= J Afp(A l |x,I)dA. i (4.37) 

The first and second moments provide the mean value and rms deviations of Xi as 

A t =< A} > (4.38) 



a Xi = V< Af > -A? (4.39) 
The use of these two moments is enough to describe a normal distribution, as all the 
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other moments can be derived from these two. Here I note that the Cramer-Rao criteria 
is also relevant. It means that the rms value derived above is bounded as follows: 

a 2 Xi > H- 1 ^) (4.40) 

As a result, the error bars derived from the Bayesian approach are larger than those 
returned using MLE. It means that, contrary to popular belief, the Bayesian approach 
is more conservative than MLE. 

If the posterior probability is not normal, higher moments could be quoted. It is rather 
impractical to quote all the moments higher than 2. Instead, we can use the value of the 
median and of the percentiles. We can define, for a random variable x with a probability 
distribution p, two values x\ and X2 such that for a given percentile q we have: 

/XI f + OO 

p(x)dx = / p(x)dx (4-41) 
-OO J X2 

When q = 50%, we have x\ = xi thereby providing the median. Usually for a gaussian 
distribution, the l-cr and 2-a values provide a percentile of 15.9% and 2.3%, respectively, 
i.e. 68.2% and 95.4% of the values are in the range defined by x\ and x%- The use of the 
4 percentiles and the median can be shown in a so-called box plot. The main advantage 
of using percentiles is that they are invariant under a change of variable. In other words, 
when making a change of variable from x to g{x) in Eq. (|4.4ip . the x\ and X2 returned 
are unaffected by the transform (provided that g is a monotonic function of x). 

In practice, when the analytical integration cannot be done, Eq. (|4.37p is computed 
using the MH algorithm mentioned above, such that we have: 

* t 

where N t is the total number of samples computed returned by the MH algorithm. Then 
the median and the percenti les are computed by so rting the values of a[ . Examples of 



such results can be found in Benomar et al.l ( 2009a ) 



Role of the prior. The prior probability expresses what we believe we know (or not) 
about the parameters A,. The choice of the prior is related to the amount of information 
at our disposal. The most obvious prior probability of the parameters A^ of interest is the 
one that is uniformly distributed over some range; this is an uninformative prior. The 
role of the prior and its impact on the posterior probability should ideally be as small 
as possible . Objec tive uninformative priors are derived using the procedure described 



bv I Jeffreys 



19461 ). related to the calculation of the determinant of the Fisher matrix 
( Fisherl 19251 ). An informative prior could be, for example, a gaussian distribution of a 
given parameter A^ . Priors are not always proper in the sense that they are not always 
related to a proper probability distribution, i.e. providing finite moments. The 1/cr prior 
for the unknown rms value of a parameter is a specific example of an improper prior. An 
extensive dis cussion on the impact of the prior in a Bayesian framework has been well 
developed by Ijavnes! ( 1987 ) . 



Significance and model comparison. The power of the Bayesian approach is also 
to be able to compare different models. The approach used by frequentists using the 
likelihood ratio test outlined in the previous section is quite similar to what is called the 
Bayesian odd ratio. Let us assume that one wants to compare between different model 
M„. The odd ratios between any two sets of models is: 

p(M n \x 7 I) _ p(M n \I) p(x\M n ,I) 
Un/m P (M m \x,I) p(M m \I)p(x\M m ,I) [ ■ ° j 
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The second part of the equation was derived from Bayes' theorem. The second fraction 
closely resembles the likelihood ratio presented above, but it is the product of the ratio 
of the prior model probabilities by the ratio of the global likelihood. It is termed global 
because these probabilities are not a point-like estimate (as in the frequentist approach) 
but an integration over all the possible values of the estimated parameters A; . The global 
likelihood p(x\M m , I) is written as: 

P {x\M m J) = f p(X\M m , I)p(x\X, M m ,I)dX (4.44) 

where p(A|M m , I) is the prior probability, and p(x\ A, M m , /) is the likelihood. The compu- 
tation of the global likelihood is rather difficult but can be done using the MH algorithm 
under parallel tempering. The inte gration of Eg. (14.441) ca n then be done using the so- 



called thermodynamic inte gration (German fc Mend 19981) . Applications of this kind of 



integration can be found in iGregorvl (|2005r) 



It is also useful to express the posterior probability of each model p(M„|x, /) as follows: 

T,P{ M m\I)P[x\M m ,I) 

Usually if the model comparison is an objective Bayesian analysis, then all prior model 
probabilities are equal (j>(x\M m , I) = p(x|Mfc,/),V m,k). This assumption can be used 
when the models are strictly different and are not nested. Here nested means that a child 
model relies on a parent model, the former having more parameters describing the model 
that the latter. Most of the models we used are indeed nested, they differ from each other 
by a few parameters. In this case, it results in the so-called model multiplicity that must 
be taken into account under the subjective Bayesian approach. Under this approach, it 
is possible to have model probability differing from each other (p(M m \I) ^ p(Mfc|J)). A 
by-product of the subjective Bayesian approach is also to provide an estimate of these 
p(M m \I) based on the data. Model multiplicity ha s just been started to be taken into 
account in model comparison (IScott fc Bergerl2010t ). Since this is very recent, I advise the 



reader to inform themselves on whether this approach can be useful for model comparison. 



5. Application to asteroseismology 

In the two previous sections, I laid down the foundations for applying harmonic analysis 
and statistics to astrophysics. In particular, the field of asteroseismology is extremely 
relevant for these applications. Stars have been known to oscillate since, at least, the 16 th 
century when David Fabricius found that o Ceti (Mira) was variable. Since then many 
others stars such as Cephei, 5 Scuti, Cephei ds, 7 Dor or solar-like stars have been found 
to oscillate (See IChristensen-Dalsgaard 2004 . and references therein). Stellar oscillations 
are mainly excited by an opa city-driven mechanism (k m echanism) and by turbulence 
occurring in convection zones ( Gautschv fc Saiolll995 . 1996T) . Broadly speaking, the world 
of stellar oscillations can be divided in two categories: 

• Periodic pulsations having a weakly time-dependent amplitude 

• Oscillatory eigenmodes stochastically excited 

The first type results from overstable oscillations in stars being driven by the k mecha- 
nism, whose amplitudes are limited by a non-linear mechanism. The functional form of 
the variation of the luminosity with time is then periodic but not necessarily sinusoidal. 
In that case the amplitude of the oscillations varies more slowly that the periods of the 
oscillations. 
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The second type results from modes being ran domly excited in so lar- like stars whose 
amplitudes are damped by various mechanisms (jHoudek et al. I ll999h . In that case, the 
amplitude of the oscillations varies on time scales shorter than the periods of oscillations . 



Ther e are stars for which the two types of oscillations co-exists ( Belkacem et al. 20091 . 
2010h . 



When applying various tools for obtaining the frequencies of the oscillation (frequencies 
which describe the internal structure of the star) , then one should ask oneself which type 
of functional form the stellar oscillation will have. Typically there are three type of 
functional forms: 

• Periodic non-sinusoidal 

• Sinusoidal 

• Harmonic oscillator stochastically excited 

These functional forms being periodic the Fourier transform is the obvious choice for time 
series analysis. As for the last functional form, the random nature of the excitation im- 
poses the application of a proper statistical treatment of the time series and its associated 
power spectrum. Hereafter, I will treat the two most common cases encountered in aster- 
oseismology: Classical pulsators (periodic non-sinusoidal function), Solar-like oscillators 
(harmonic oscillators) 



5.1. Classical pulsators 

For stars having luminosity variations whose functional forms is sinusoidal, the common 
practice is to use the Fourier transform described in the previous sections. The first 
applic ation of statistic s and Fourier transform was for finding periodicities in earthquakes 
due to ISchusterl (|1897l ). which has since then be coined the Schuster periodogram. 

For variabl e stars (or classical puls ators), the first application of the periodogram is 
attributed to Wehlau fc Leung ( 19641 ). Since that date, the analysis of time series has 
been evolving to take into account various aspects related to the presence of gaps and the 
large dynamic range in the amplitude of the periodicities. One of the problem encountered 
when observing stars from the ground is that the periodic gaps in the observation (due to 
the day-night cycle) introduce aliases. The presence of these gaps produce then spurious 
peaks located on either side of the main peak located at multiple of ±11.57/i Hz (1/24 hr). 
One soluti on for taking into ac count such a frequency response is to apply the CLEAN 
algorithm dRoberts et al. 1987 ) which is used in radioastronomy for aperture synthesis 
( Hogbomlll974l ). Another approach is to use a combinaison of the CLEAN algorithm and 
of prewhitcning which consists in removing the signal of largest amplitude in the time 
series (after having being detected) and then to re-compute the periodo gram; the proce- 



dure iterates until there is no large signal detected in the periodogram (jBelmonte et al 



19911) . This latter technique is now the most commonly used for classical pulsators. 



5.1.1. Spectral analysis revisited. 

Single sine wave. I already touched upon the Fourier analysis of pure sine waves us- 
ing the frequentist framework (application of Least Squares) . Here I shall briefly revisit 
what can be done when using a Bayesian approach. iBretthorsti (|1988l ) , using a Bayesian 
approach to the analysis of the time series of a pure sine wave sampled regularly and 
embedded in noise having a gaussian distribution, demonstrated that the posterior prob- 
ability for the frequency v of the sine wave can be written as: 



\nP(v\x,a, I) 



C(y) 



o~ 



(5.1) 
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where x are the data (xi taken at time U), a is the rms value of the noise assumed to be 
known, and C{v) is the Schuster periodogram given by: 



C(y) 



1 

N 



N 
i=l 



(5.2) 



This is nothing less than the power spectrum. For a pure sine wave with frequency Uq, 
the periodogram has a maximum at that frequency. Using the posterior probability for 
v, we then have the first moment of the frequency as: 



P(v\x, a, Vjdv = i/q 



(5.3) 



Following this approach. I Javnes (1987) demonstrated using Eq. (|5.2p that for a pure sine 
wave of amplitude A, the rms error on the frequency vq is given by: 



5v = 



7r A Ti 



(5.4) 



where T is t he observing ti me and At is the samp li ng tim e. This formula is the same 
as given by ICuvpers (1987) and derived by Koen ( 19991 ). The frequency precision is 
inversely proportional to the signal-to-noise ratio computed in the time domain. This 
equation also shows that the Rayleigh criterion (1/T) for frequency resolution is very 
pessimistic compared to the precision with which frequencies can be measured. 
Many sine waves. The approach outlined above linking the posterior probability to the 
Fourier transform justifies a posteriori the use of that transform: " The highest peak in the 
discrete Fourier transform is an optimal frequency estimator for a data set which contain s 
a single harmonic frequency in the presence of Gaussian white noise" , lBretthorstl(|l988l ). 
Fortunately, for several harmonic signals whose frequencies (i/j) are far from each other, 
the naive use of Fourier can be broken down in the application of the preceding section 
as many times as required or: 



h\P(vi, ...h> r \x,a,T) = 



(5.5) 



This justifies the use of the periodogram for finding harmonic signal in a time series 
( Bretthorstlll988l) . 

Periodic signals. When the signal is periodic but not sinusoidal, the decomposition 
using the periodogram will provide the fundamental Vf of the period and spurious fre- 
quencies at nvf. The presence of these spurious frequen cies can become dif ficult to handle 
when there are a lot of frequencies such as in 8 Scuti ( Poretti et al. 20091 ). The periodic 
signal can also be the result of non-linear saturati on which cannot be easily described 
in terms of harmonic signals ( Yoachim et al. 20091 ). As mentioned above, prewhitening 
together with the use of the CLEAN algorithm is a possible solution for analysing such 
periodic signals. The analysi s can also be perfor med using other techniques such as Prin- 
cipal Components Analysis ( Tanvir et al. l2005h . 



I showed in a previous section how one could analyse data, that are not equally sampled 
in time, using the Lomb-Scargl e periodogram. Th e LS periodogram has been extended 
not only to a decaying sinusoid (|Bretthorstll2001al) but also to periodic functions in gen- 
eral ( Bretthorst] 2001bT) . This revisitation of spectral analysis by Bretthorst is extremely 
useful when one want s to unders t and ho w to apply a Bayesian analysis to time series. 
Based on the work of iBretthorstl (|2001bl ). a possible application would be to model the 
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amplitude limitation 6 Scuti to obtain the functional forms of the periodic signal. A dif- 
ferent functional form from a sinusoid could be u sed as an advantag e for disentangling 
the various possible combinations of frequencies ( Poretti et al.l 120091) either due to the 



gaps, the periodic signal or the physics of the stars. This is an avenue yet to be explored. 

5.2. Solar-like oscillations 

The analysis of solar-like oscillations is slightly more complicated because of the random 
excitation of the modes. In order to apply the Fourier transform and use statistical 
tools for extracting mode frequencies and mode parameters for such stars, one has to 
understand how the functional form of the mode amplitude is related to the harmonic 
oscillator being randomly excited. 

5.2.1. Randomly excited harmonic oscillations 

It is well known th at pressure modes (or p modes) are stochastically excited oscillators 
( Kumar et al 1 1l988h . The source of excitation lies in the many granu les covering the 



star (Turbulent convection. iHoudek et al. 19991 Samadi fc Goupill 2001 ). The modes are 



assumed to be independently excited (|Kumar et al.lll988h because the eigenfunction of the 
modes is primarily radial and nearly independent of degree in the upper convection zone, 
where the modes are excited. The eigenmodes are harmonic oscillators being intrinsically 
damped, and excited through a forcing function F. The differential equation of such a 
damped harmonic oscillator is written as: 

f^l + 2 , 7 ^£ + (2nfv 2 x osc = F(t) (5.6) 

where t is the time, x osc is the displacement, 7 is the damping term related to the mode 
linewidth, v$ is the frequency of the mode and F(t) is the forcing function. Equation (|5.6|) 
is also the expression of an auto-regressive (AR) process which in this case is a stationary 
process. From this equation the Fourier transform of x can be written as: 

iosc(^) = JZ so/ 2 9 I ■ \ (5-7) 

(27r) 2 (z/jJ - v 2 + i^v) 

where x osc {y) and F(y) are the Fourier transform of x asc {t) and F(t). From the large 
number of granules, it can be derived that the forcing function is normally distributed. 
Therefore the 2 components (the real and imaginary parts) of the Fourier transform of 
the forcing function are also normally distributed (See Section l37Tj) . Therefore, for the 
harmonic oscillator, each component of x osc (is) is normally distributed with a mean of 
zero, and the same variance. The variance is related to the spectral density given by: 



S F {y) 

(27T) 4 [(^_^2)2 + l/ 2 7 2 

The spectral density of x osc {v) has then a x 2 with 2 degrees of freedom statistics. For 
such statistics, I outline that the variance is the same as the mean. Equation (|5.8p is the 
p-mode profile that is usually approximated by a Lorentzian profile when v « vq as: 

1 2 (2tt) 4 ^1 + x 2 

with x = 2(y — fo)/7- An asymmetry effect can also be introduced in the profile of 
Eq. (|5.9[) . T he asymmetry effec t was first detected with instruments making an image 



««^£oTT3 (5-9) 



• y ertec 

of the Sun (iDuvall et al.ll993h . The asymmetry is due to the location of the excita- 



tion of the modes with respect to the resonant cavity. There is a direct analogy with 
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a source being place d inside a Fabry-Pero t cavity ( Duvall et a l. 1993). The asymmetry 
was then detected bv lToutain et al. ( 1997 ) with instruments observing the Sun as a star; 
the sign of the asymmetry depending upon the observables (intensity of velocity). The 
empirical theoretical ex planation for the di fferent asymmetry in intensity with respect 
to velocity was given bv lNigam et al. I (|l998l) . It is related to the different correlation be- 
tween the backgroun d and the modes in velocity and in intensity which were studied by 
Severino et al.l (|2001l ). The theoretical framework pr ovided then a simple expres sion for 
the expression of the mode profile with asymmetry (jNigam fc Kosovichev 1998 ). given 
by: 

S osc {p) = H l -±^- + B 2 (5.10) 
1 + 

where B is the asymmetry effect and H is the mode height. The effect of the linear slope 
is to change the sign of asymmetry when the sign of B changes. When B is positive 
(negative) there is more power at high (low) frequency. Therefore, if the asymmetry is 
not taken into account, the sign of B will affect the true location of the mode frequency 
differently. This effect, if not taken into account, may provide large systematic errors when 
comparing different observables thereby providing potential problems for the inferred 
model of the star. 

5.3. Random non-harmonic field 

Another source contributing to the observed power spectrum is the noise generated by 
the star itself. It is quite customary to have the stellar noise increasing at low frequencies. 
As a matter of fact, the noise observed is not limited to stars but more related to an 
intrinsic behaviour encountered in many physical phenomena. This so-called 1// noise 
is so ubiquitous that it is found in almost any physical measurements and also in stars. 
The 1// noise appears in many electrical applications and other applications (jKeshner 
1982). While white noise is known not to have any memory (autocovariance is the Dirac 
distribution), the 1// noise possesses some memory. Mathematically, AR models have 
also these properties. Random processes following AR models have the desired properties 
of having me mory, thereby producing l//-like power spectra. This fact was used by 
Harvevl (|l985h for deriving the solar background noise spectrum for instruments observing 
solar radial velocities. The spectrum is derived from a superposition of four components 
related to s olar act i vity a nd to three different types of granulation, having different 
lifetimes. In Harvev ( 1985|). the power law used had a -2 slope; it was later refined to be 
arbitrary b ( Harvev et al.Fl993h such that we have for the spectral density of background 
noise x n (t): 

w=E 1+(2 t^ ( 5 - n ) 

where r, is the characteristic time of the decaying autocorrelation function of the process, 
Hi is the height in the power spectrum at v — 0, and bi is the power law exponent (with 
the bi ranging typically from 2 to 6 for intensity measurements; lAppourchaux et al 



20021: lAigrain et al.ll2004l ). It is clear that Eq. (|5. 1 1|) also has a Lorentzian-like shape as 
much as the harmonic oscillator of the previous section. Therefore I also outline that 
modes stochastically excited since they also follow an AR model, have also the same 
mathematical property as the random non-harmonic field. 

5.4. Power spectrum model 

Stationary processes. I showed in Section 13771 that stationary processes have indeed a 
X 2 with 2 d.o.f statistics. The question is: what is the mean value when different stationary 
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processes are in play? Let us assume that two time series for two different processes co- 
exist: x osc {t) related to the excitation of the harmonic oscillators (the eigenmodes), x n (t) 
related to the non- harmonic noise (the background stellar noise). The total time series 
is then given by: 

x(t)=x osc (t)+x n (t) (5.12) 
and the autocorrelation is given by: 

E[z(fi)x(t 2 )] = C osc {t) + C n (r) + \E[x OBe (ti)x a (t 3 )] + E[x n (ti)x osc (t2)]] (5.13) 

with r = ti — t\ and where C sc, C n are the autocorrelation for the harmonic and non- 
harmonic signals. The last term in brackets is related to the potential correlation between 
the two type of stationary processes. Usually, we assume that there is no correlation be- 
tween these processes such that the term in brackets is zero. Using the Wiener-Khinchine 
theorem, we have for the spectral density: 

S x (v) = S osc (u) + S n (y) (5.14) 

So the spectral density of the sum of two independent stationary process is the sum of 
the spectral density of each stationary process. This implicit assumption, not frequently 
mentioned, is usually a good approximation for solar-like oscillations. If we assume that 
the processes are correlated but that their correlation is stationary we can then write: 

S x (u) = S osc (v) + Sn(v) + I{y) (5.15) 

where I{v) is simply the Fourier transform of the expression in brackets of Eq. (|5.13j) . 
Su ch correlations betwee n the background and the harmonic oscillators have been studied 
bv lSeverino et ah ( 2001). The term I is resp onsible for the mode profile asymmetry as 



By loevermo et ai. llzuuil). i ne term l is resp 
explained above ( Nigam fc Kosovichevlll998| ). 



Model of the spectrum. The complete model of the power spectrum is given by the 
superposition of all the harmonic signals and of the non-harmonic signals. Apart from 
the Sun, so far we have only observed disk-integrated oscillations in stars. Instruments 
integrating over the stellar surface observe the velocity or the intensity signal as a super- 
position of various modes of different degrees. The impact of this integration is to filter 
out the higher degree modes, keeping typically only the degrees below 4. The resulting 
mo de sensitivity (Vi) was first computed for velocity measurements for non-rotating stars 
by IChristensen-Dalsgaard fc Goughl (|l982h . then by IChristensen-Dalsgaardl (|1989l ) tak- 
in g into account Doppler imagin g due to rotation; for intensity, it was first computed 



IS 

bv lToutain fc Gouttebrozd (J1993). The impact of t he inclination angle of the star upon 



the mode sensitivity (q. to ) was also co mputed by iToutain fc Gouttebrozd (|1993l) . cal- 
culation which was later re-discovered by iGizon fc Solankil (|2003h . The mode sensitivit y 



Vi has also been extensiv ely c omputed by differe nt authors; e.g lBedding et al.l ()199(: 
Appourchaux et al.l (|2000h and iBallot et alJ (|2006h . 



Using the equations given above and taking into account the geometrical mode sensi- 
tivity, the full spectral density is therefore given by: 

S x (u) = £ H n V t \ m L f^J^tA +V f ft (5-16) 

n,Lm 



where H n is the mode height for radial order n, L is the mode profile, T n i m is the inverse 
mode lifetime, v n im is the mode frequency; and the Hi, t% and bi represent the stellar 
and instrumental background noise (including also the photon noise). The mode profile 
is here assumed to be Lorentzian. It can be replaced by an asymmetrical profile but at 
the expense of making an incorrect approximation of the correlation effect /. If one wants 
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to pro perly take into account this correlation effect, one should refer to ISeverino et al 
( 200lh . 



Several effects will lift the degeneracy of the mode frequency v n \ such as stellar rota- 
tion or stellar activity, thereby splitting the frequency of the mode in different compo- 
nent (Zeeman-like effect). In this case, the mode frequency can be decomposed into the 
Clebsch-Gordan coefficient as: 



Y^ai(n,l)lP^(m/l) (5.17) 



where i max ^ 2Z, and ai{n, I) are the splitting coefficients and V\ % ^ are polynomials given 



bv iRitzwoller fe Lavelvl (|l99ll ) or by IChaplin et all (|2004) . The advantage of the vf* 



is that they are by construction orthogonal to each other. With this property, the max 
i for the decomposition can be different from 21, in other words removing or adding 
polynomials will not affect the value of the ai(n, I). To second order, this property may 
not be completely true as there could be slight variations in the mode height due to 
stochastic excitation or other effects not modelled. For example, the observation of the 
integrated signal over the stellar surface with a star observed perpendicular to its rotation 
axis (i = 90°) will remove modes for which I + m is odd. If this averagi ng effect is not 



taken into account the decomposition will affect the value of the a^(n, /) (|Chaplin et al 
20041) . 



Depending on the model used, different assumptions concerning the mode and back- 
ground parameters are possible. For instance, the mode linewidth can be assumed to be 
independent of frequency, or to vary with frequency according to some polynomials, or 
to be independent for each order n, or to be independent for each degree I, and so forth 
and so on. It is sometimes desirable to reduce the number of fitted parameters using such 
assumptions. Examples of such assumptions are found in the references given in the next 
two sections. 

5.5. Frequentist parameter estimation 

Now having observed a star, we have a time series of intensity or velocity fluctuations 
which is properly sampled, and for which the filtering effect of the data reduction is un- 
derstood. We compute the Fourier transform and then obtain the power spectral density 
properly scaled using Parseval's theorem. The amount of gaps is negligible such that the 
frequency bins are independent of each other. The statistics of the power spectrum is \ 2 
with 2 d.o.f. Assuming that all the stationary processes are nearly independent, we have 
a complete model of the power spectrum including harmonic and non-harmonic signals. 

Now that I have set the scene, all the actors are in place for the play. The method 
applied for deriving all the parameters of Eq. (|5 . 1 6[) is usuall y to use MLE. Th e use o f 
MLE for fitting solar power spectrum was fi r st me ntioned by iDuvall fe Harvevl (|l986l ). 



then applied and tested bv lAnderson et aL (199(3); m which they also mentioned the 



use of Monte-Carlo simulations for validating the method. Their pioneering work led in 
helioseismology t o the understanding of the der ivation of error bar for mode frequencies 



( Libbre cht 1992c iToutain fc Appourchauxlll994[) which is given by 



This equation shows that the mode precision is proportional to the square root of 
the mode linewidth and is proportional to the square root of the frequency resolu- 
tion. This results greatly contrasts with that of a pure sine wave given by Eq. (|5.4j) . 
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Similarly other error bars wer e also derived for li newidth, mode height, white noise 

dTout ain fc Appourchauxl 19941) : rotational s plitting dToutain fe Appourchauxlll994HGizon fc Solankil 
20031: ballot et al.ll2008l ): inclination angle (jGizon fc Solankil l2003t iBallot et al.l l2Q08h ; 



and of the correlation between the mode height, the linewidth and background noise 
( Toutain fc Appourchaux 19941) (See Appen dix A); and between the rotational splitting 
and the inclination angle ([Ballot et al. I l2008l) . 

Monte-Carlo simulations have also been used for verifying the veracity of using the in- 



verse of the Hessian for estimating error bars on the parameter s of the model ([A nderso n et al 



1990t iToutain fc Appourchaux! Il994l : ISchou fc Brownl Il994l : lAppourchaux et al.l Il9~ : 
Gizon fc Solankill2003l) (See Section . When doing so, it was patent that the distri- 
bution for mode height, linewidth and background noise was not normal but log-normal. 
There are two main reasons for this state of affairs: first, these simulations are far from 
the asymptotic behaviour, otherwise the distribution would be normal; second, the math- 
ematical transform log(a;) allows us to have a distribution closer to a normal distribution 
than x alone. In other words, the asymptotic behaviour is reached faster with such a 
transform. The same would apply to any transform reducing the amount of value larger 
than the median with respect to value smaller than the mediarjj 

Although the MLE has been in use for more than 20 years by helioseismologists, it 
is not yet completely adopted by our fellow asteroseismologists. The body of evidence 
provided by helioseismology is rather large, hereafter I will necessarily refer to a few exam- 
ples: IPHIR (Interplanetary Helioseismology by IRr adiance measurements) ph otometers 
aboard the Russian mission Mars94 (Sun as a star; Toutain fc Frohlicb 19921 ). VIRGO 
(Variability of solar IRradiance and Gravity Oscillations) photometers aboard SoHO 



(Sola r and Heliospheric Observ atory) (Sun as a star and low resolution; iFrohlich et al 



Il997t lAppourchaux et al.l Il997h . GOLF (Global Oscillati ons at Low Frequency) spec 
trometer aboard SOHO (Sun as a star, iGellv et al.ll2002l). BiSO N (B irmingham Solar 
Oscillations Networ k) instrument (Sun as a star IChaplin et ah 19961 ). LOWL (Low-Z) 
instrument (imager: ISchoulll992l: ISchou fc Brownlll994l). GONG (Global Osci llation Net- 
work Group) instrument (imager; Hill et all 1996t Appourchaux et al. 19981 ). SOI/MDI 
(Solar O scillations Investigation / Michelson Doppler Imager) instrument aboard SoHO 
(imager ; iKosovichev et al.lll997 ) . 
Following these successful applications for solar data, MLE was preliminarily tested 



on syn thetic stellar timeseries for use on the CNES C0R0T mission ([Appourchaux et al 



2006allbT l. The recipe lai d down in this latter publica tion was found later not be applicable 
to the real C0R0T data. lAppourchaux et al.l (120081) introduced a s cheme using not fit on 
narrow frequency window ( local fit as in lAppourchaux et al.ll2006al ) but using fit on large 
frequency window (global fit). This new scheme has now been applied to solar-like stars 
and red giants observed by C0R0T (iGarria et al. 20091 : Barban et al. 20091 : Mosser et al 



20091: iDeheuvels et al.ll2010l : iBaudin et alj|201ll ): and to solar-like stars o bserved by the 



NAS A (National and Aeronautics Space Administration) Kepler mission (|Mathur et al 
201 lh . 

One of the recent controversies in asteroseismology is relate d to the degree identif i cation 



in the power spectra of HD49333 (observed by C0R0T). In Appourchaux et al. ( 20081 ) 



the degree of the modes could not easily identified leaving two possible solutions: the 
ridges in the echelle diagram were either I = — 1 or I = 1 — 0. The rea son for such a 
difficult identification, already anticipated by lAppourchaux et al.l (|2006al) . is related to 



a mode linewidth larger than in the Sun combined with a narrower small spacing than 



f The reader may convince themselves by simulating a \ 2 with n d.o.f for example together 
with a log x or ^Jx transform. 
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in the Sun. As a result, the two ridges I = — 2 and I 



each other. One of the solutions found by lAppourchaux et al 



1 wer e indis tinguishable from 
(|2008h was to base the 



ch oice of the ridge up o n the likelihood ratio test. Here I would like to cite a warning 



Appourchaux et al.l ( 2008 ) as to the use of this test: "However, it is important to 



remember that a higher likelihood does not mean that a given model is physically more 
meaningful, rather, it means the model is statistically more likely." It was clear at this 
stage that in order to take into account prejudices and a priori knowledge that could 
have helped to get the physically meaningful model, a Bayesian approach to the problem 
was required (ibid). This is the subject of the next section. 

5.6. Bayesian parameter estimation 

In asteroseismology , the first attempt at applying a Bayesian approach was due to 
iBrewer et all (|2007l ). They used a model based upon pure sine waves which did not 
correctly model stochastic excitation of harmonic oscillations. Their model can be used 
for classical pulsators or when the modes ha v e a li fetime longer than the observation 
time. Following the att empt by IBrewer et al. ( 2007 ). I realised that the approach and 
framework provided by Bretthorstl (|l988j ) is not generally applicable in asteroseismol- 
ogy. The separation in time between the deterministic signal and the noise cannot be 
done when oscillators are stochastically ex cited. Since th e sepa ration is not possible, the 

needs to be replaced by 
. At the time, it was not 



statistics of the stellar signal suggested b y IBrewer et al 



the likelihood usually used by frequentist lAppourchaux 
clear that the replacement wa s so obvious 




Following the early work of lAppourchaux! (|2008f ). the application of Bayesian analysis 
to solar-like oscillators has been quite extensively developed. The difficult data analysis 
of H D49333 led to the development of several Bayesi an data analyses in asteroseismol- 



ogy (jBenomar et al. 2009at iGruberbauer et al 20091 ). A simplified Bayesian approach 
called Maxim um A Posteriori (M AP) has al so been used for putting constraints on the 



mode height ( Gaulme et al. 20091 ). Recently. iHandberg fc Campante ( 2011 ) produced a 
very nice guide for anyone wanting to do Bayesian data analysis in asteroseismology. A 
more extensive guide in French is also available in Benomar ( 2010( ). These two guides 
covers many aspect of the application of Bayes' theorem such as power spectrum mod- 
els, statistics, priors, parallel tempering and global likelihood. Here I should add that the 
Bayesian framework has been used not only for parameter estimation but also for making 
decis i on based upon posterio r probability estimates: mode detection (lAppourchaux et al 
20091 : iBroomhallet al.llioiol) . presence of I = 3 modes and mixed modes (jDeheuvels et al 
201 



As anticipated by lAppourchaux et al. I (l2008l). the H D49333 contr oversy led to the first 
useful ap plication of the Bayesian a pproach. iBenomar et al.1 (|2009al ) using the same time 

globa l 



series as 



Appourchaux et al.l (|2008j ) applied a Bayesian analysis and derived th e 



likelihood of the various models previously used. In doing so. IBenomar et aT ( 2009a ) 



showed that amongst the four models used the identification of lAppourchaux et al.l (|2008l) 
was the most probabl e one but for a model ha v ing no I = 2 modes ! When these lat 



ter are included (as in Appourchaux et al. 20081 ). Benomar et al.l ( 2009a ) found that the 



identification of lAppourchaux et al.l ( 20081 ) was less probable (14%) than the other identi 



fication (23%); thereby showing that both models were n early equiprobable. The ab sence 
of I = 2 modes in HD49333 was also the conclusion of Gruberbauer et all d2009f) who 



also u sed a Bayesia n approach. The main d ifferences of the work of IGruberbauer et al 



( 20091 ) from that of Benomar et al. ( 2009a ) are that they do not use global likelihoods 
for decision and that their model has no splitting (no degree identification possible). 
The controversy was finally closed when 180 days of data were available for HD49333. 
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Using these data, Benomar et al.l ( 2009bl ) showed that both frequentist and Bayesian ap- 
proach es favour a model haying I = 2 modes with a degree identification different from 



that of lAppourchaux et al 



(120081) . 

What lessons can we draw from such a controversy? The process of the advancement 
of science cannot be done without making errors. The can be of several kind: 

• Done on purpose for hiding the facts (a lie) 

• Done by ignorance of the facts 

• Done in knowledge of the facts 

The first kind of error has lead to several infamous publication such as a cold fusion 
(jFleischmann fc Pons! fl989h finding which was never reproduced. The second kind of 
error is due to lack of information or knowledge; in this world of information flow this 
can easily happen to anyone of us. The third kind can either be perceived as an error 
from an outsider or as a dec i sion d one in full conscience of what has been decided. The 

(|2008l ) is obviously akin to the last kind. If we would have 



error of lAppourchaux et al.l 
had the choice, we would have certainly opted for applying the Bayesian approach in 



Appourchauxl ((.2008) but time prevented us to do so. This is another lesson to be learnt: 



even in this fast paced world of information flow and publications, one should never trade 
quality of publication and scientific integrity for fast publication. Science is like pasta, a 
slow sugar. 



6. Conclusion 

If you have been able to survive until now, I would like to thank you. It would be 
pretentious to believe that this Conclusion is going to conclude anything for good. As 
a matter of fact, this conclusion aims at giving the reader the incentive to go beyond 
this course but still using this course as a foundation for the future. Data analysis clearly 
starts from the instrumentation itself, this aspect was not treated here. It is clear that the 
measurement leading to the analysis of time series is done through an instrument. From 
this point of view, the aspects related to the filtering due to the instrument are implicitly 
laid out. Apart from these mere details, the reader should find all that is needed for a 
proper understanding of data analysis in asteroseismology. Needless to say, that this field 
is evolving very rapidly, thanks to the CoRoT and Kepler missions. In the last couple 
of years, the Bayesian data analysis has developed quickly. Despite what is commonly 
thought, the Bayesian approach although using prejudices is in the end more conservative 
than the frequentist approach. Actually, the main drawback of this approach is not in 
the methodology itself but the speed at which the computation of the results can be 
performed. 

With the advent of the new PLATO mission, it can be anticipated that both Bayesian 
and frequentist data analysis will be used for deriving the mode parameters of more 
than 20,000 stars. I can think of a hybrid scheme where both approaches will be used 
depending on the signal-to-noise ratio in the power spectrum. Another aspect which was 
partly discussed here is how one can assess and insure that the fitted mode parameters 
are of usable quality for doing stellar models. This is certainly to be one of the future 
challenges of asteroseismology: the automatic assessment of stellar mode parameters. I 
can also anticipate that we will apply quality assessment which are regularly used in 
factories doing mass production. For instance, sampling at random some of our output 
products may be useful for qualifying the lot as useable by the scientific community. We 
are really in the infant stage for such massive quality assessment. 

The future of asteroseismology will reach another milestones when the surface of stars 
will be imaged. At that point time, we will then be able to probe the internal and 
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dynamics structure of stars in the very same manner as we did with the Sun. The Stellar 
Imager could be the next asteroseismic challeng e whose achievement may be not so far 
in the future ( Christensen-Dalsgaard et al.ll2011l) . 



As for any new endeavour, we need to learn from the past for going ahead while keeping 
in mind Timothy Leary's motto: Think for yourself, question authority. 

Acknowledgments. I would like to thank Pere Palle for inviting me to give these lectures 
in the beautiful island of Tenerife. I would also like to express my thanks for the many 
discussions I had with my long standing colleagues from which this course benefited: 
Frederic Baudin, Othman Benomar, Patrick Boumier, William Chaplin, Patrick Gaulme, 
Laurent Gizon and Takashi Sekii. This course also benefited from the acute reading of 
John Leibacher. Last but not least, many thanks to my wife, Maryse, and my sons, Kevin 
and Thibault, for their indefectible support; without forgetting the mention of the purring 
cat, Myrtille. Thanks to the volcano of El Teide for momentary time support, going up 
there was not an easy ride ! 



Appendix A. Correlation between mode height and linewidth 

iToutain fc Appourchaux ( 1994) showed that some fitted mode parameter were intrin- 
sically correlated. When one wants to derive the mode amplitude A (proportional to the 
square root of TH) one should take into account the correlation between the errors on 
r and H which are the mode linewidth the mode height, respectively. Since we have 
A oc TH , and using the logarithm of these quantities, we have: 

logA=-(Iogr + logfl r ) + ... 



or simply 



The error bar on a is derived from: 



-{l + h) + 



1 



2E(ah)) 



which can be rewritten using the correlation coefficient as: 

o 1 



Oft 



(Al) 



(A 2) 



(A3) 



(A 4) 



Using the work of IToutain fc Appourchauxl (|1994[ ). and after some mathematical manip- 
ulation, I can derive for a single mode the correlation as: 



P = 



E(7h) 

0~hO~~, 



(A 5) 



where /3 is the noise to the mode height ratio in the power spectrum. This correlation is 
negative, and its absolute value always greater than 76.5% when (3 < 1. It means that 
even when the modes are easily detected the correlation is always very large and never 
negligible. Finally the error bar on a is then given by: 

a a = —!=^(y/pTT+y/p) ^( v ^TT+ v ^) 3 -4(l + /3)^ (A6) 
where T is the observation time. From this equation, I can deduce that the precision 
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on the measurement of the mode amplitude increases with the observation time and the 
mode linewidth. 
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